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Abstract. Under complete linkage disequilibrium (LD), robust tests 
often have greater power than Pearson's chi-square test and trend 
tests for the analysis of case-control genetic association studies. Robust 
statistics have been used in candidate-gene and genome-wide associa- 
tion studies (GWAS) when the genetic model is unknown. We consider 
here a more general incomplete LD model, and examine the impact of 
penetrances at the marker locus when the genetic models are defined 
at the disease locus. Robust statistics are then reviewed and their ef- 
ficiency and robustness are compared through simulations in GWAS 
of 300,000 markers under the incomplete LD model. Applications of 
several robust tests to the Wellcome Trust Case-Control Consortium 
[Nature 447 (2007) 661-678] are presented. 

Key words and phrases: Efficiency robustness, genetic models, genome- 
wide association studies, linkage disequilibrium, ranking and selection, 
incomplete LD model. 



1. INTRODUCTION 

Genome-wide association studies (GWAS) have 
been used to detect true associations between 100,000 
to 500,000 genetic markers (single-nucleotide poly- 
morphisms — SNPs) and common or complex dis- 
eases (e.g., Klein et al., 2005; Sladek et al., 2007; 
WTCCC, 2007). Currently, up to a million SNPs 
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are used in GWAS. A simple and initial analysis 
of GWAS is a genome-wide scan, in which a statis- 
tical test is applied to detect association one SNP 
at a time. Test statistics and/or their p- values are 
obtained for all SNPs and ranked in order of their 
statistical significance. After all SNPs are ranked, a 
prespecified small proportion of SNPs from the top- 
ranked SNPs (or SNPs with p- values less than a pre- 
specified genome- wide threshold level) is selected for 
further, more focused analyses, for example, haplo- 
type analysis, multi-marker analysis, fine mapping, 
imputation and independent replication studies (see 
Hoh and Ott, 2003; Marchini, Donnelly and Cardon, 
2005; Schaid et al., 2005). The genome-wide scan has 
also been shown to be cost-effective in two-stage de- 
signs for GWAS, in which additional subjects are 
genotyped in the second stage for a small portion 
of selected SNPs in the first stage (see Elston, Lin 
and Zheng, 2007; Thomas et al., 2009). We focus on 
robust tests for GWAS in the single stage designs. 

Since only a small portion of top-ranked SNPs is 
selected in genome-wide scans, it is important that 
the probability of at least one SNP with true asso- 
ciation being selected is high, for example, greater 
than 80% (Zaykin and Zhivotovsky, 2005; Gail et al., 
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2008). The probability that a SNP with true associ- 
ation is detected, confirmed and replicated in later 
more focused analyses is often smaller. Hence, one of 
the goals of genome-wide scans is to rank the SNPs 
with true associations as near to the top as pos- 
sible. Zaykin and Zhivotovsky (2005) showed that 
the factors that mainly affect the rankings of true 
SNPs include the total number of SNPs, the number 
of SNPs with true associations, the genetic effects 
(genotype relative risks or odds ratios), the sample 
size, power of the association test used, and linkage 
disequilibrium (LD) between SNPs and the func- 
tional locus (the true unknown disease locus). Most 
of the above factors are determined by the study 
design, except the power of the test for association. 
The common association tests include Pearson's chi- 
squared test (Pearson's test, for short), the Cochran- 
Armitage trend tests (CATTs) and the allelic test. 
Three CATTs are available depending on the under- 
lying genetic model (the mode of inheritance of the 
disease locus). Common genetic models include re- 
cessive, additive, multiplicative and dominant mod- 
els. Overdominant and underdominant models may 
also be used, but they are less common. The allelic 
test has performance similar to that of the CATT 
under the additive model when the Hardy- Weinberg 
equilibrium proportions hold (Sasieni, 1997; Guedj, 
Nuel and Prum, 2008). Thus, the allelic test is not 
considered here. 

Intuitively, the most powerful test should be used 
in genome- wide scans. For common and complex 
diseases, it is possible that there are multiple func- 
tional loci with different genetic models, in partic- 
ular, for GWAS. The power of an association test 
depends on the underlying genetic models of the 
functional loci, which, however, are unknown. They 
could be any of the four common genetic models or 
none of them. In addition, imperfect LD between 
functional and marker loci can modify the under- 
lying genetic model, further increasing uncertainty. 
In this case, there is no uniformly most powerful 
test for a genome- wide scan. It is known that the 
most efficient CATT is available when the genetic 
model is known (Sasieni, 1997; Freidlin et al., 2002). 
When the genetic model is unknown, using a single 
CATT is not robust across a family of genetics mod- 
els. Therefore, in this situation, more robust tests 
have been proposed for both candidate-gene studies 
and genome-wide scans (Freidlin et al., 2002; Sladek 
et al., 2007; Zheng and Ng, 2008; Gonzalez et al., 



2008; Joo et al., 2009). The performance of the ro- 
bust test statistics has been studied under the per- 
fect LD model, that is, the SNP is the same as the 
functional locus (see more discussion later). This is, 
however, a strong assumption for GWAS. In particu- 
lar, when one of the models embedded into a robust 
test holds at the functional locus, it remains unmod- 
ified at the marker locus. Therefore, it is not surpris- 
ing that robust tests based on the maximum of test 
statistics over common genetic models often provide 
greater power than Pearson's test and CATTs. How- 
ever, when LD is imperfect, the induced penetrance 
values at the marker are weighted averages of the 
causal penetrances, where the weights are functions 
of LD. Thus, the imperfect LD will change certain 
models, such as the dominant or the recessive mod- 
els, so that the heterozygote penetrance will have an 
intermediate value between those for the homozy- 
gotes. Therefore, it is important to investigate not 
only the exact form of such penetrance modifica- 
tions, but also its impact on the performance of the 
robust tests for association. 

In this article we consider a general LD model 
with the standardized LD parameter, D' (Lewontin, 
1964), and study the properties of the penetrances 
defined at the marker locus given the genetic model 
defined at the functional locus. In addition to re- 
viewing some common robust tests for case-control 
association studies, we also compare their perfor- 
mance under this general model with a varying D' . 
Using robust tests when there is imperfect LD has 
not been studied perviously. The perfect LD case, 
where the marker and the disease loci coincide, can 
be obtained as a special case at D' = 1 , with an ad- 
ditional requirement of equality of allele frequencies 
at the marker and the disease locus. This implies 
a perfect correlation between the alleles at the two 
loci. Under this general model, we also examine the 
effectiveness and robustness of the genetic model se- 
lection procedure (Zheng and Ng, 2008). Simulation 
studies are conducted to compare the efficiency ro- 
bustness of various robust tests under this general 
model for genome-wide scans of 300,000 SNPs. Ap- 
plications of robust tests are presented using real 
data from a GWAS (WTCCC, 2007). 

The rest of the article is organized as follows. In 
Section 2 we introduce notation, the case-control 
data and different genetic models. The 
Hardy-Weinberg disequilibrium coefficient and its 
use to detect the underlying genetic model is given 
in Section 3. Various robust tests for candidate-gene 
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analysis and GWAS will be reviewed under the per- 
fect LD model in Section 4. Section 5 presents nu- 
merical results based on the simulation studies. The 
performance of the model selection procedure under 
the general LD model will be reported. Compari- 
son of several robust tests in analyzing genome- wide 
data is also presented. Applications to real data are 
given in Section 6. Discussion and conclusions are 
given in the final section. 

2. GENETIC MODELS 

2.1 Notation and Data 

Consider a case-control association study with r 
cases and s controls and a SNP with alleles A and B. 
Denote the population frequencies of the alleles by 
Pr(£?) = p and Pr(^4) = p c = 1 — p. The three geno- 
types of the SNP are denoted by G?o = A A, G\ = 
AB, and G7 2 = BB, with the population frequen- 
cies Pr(Gj) = gi for i = 0,1,2. When the Hardy- 
Weinberg equilibrium (HWE) proportions hold in 
the population, (go,gi,g2) = (j?l,2pp c ,P 2 )- The case- 
control data for the SNP can be displayed in a 2 x 3 
contingency table with the rows corresponding to 
case or control groups and the columns to the three 
genotypes. The genotype counts for (G7o,Gi,G7 2 ) in 
cases and controls are denoted by (ro,ri,r 2 ) and 
(so, si, S2), respectively. The genotype counts follow 
multinomial distributions: (ro,ri,r 2 ) ~ Mul(r;po,Pi, 
p 2 ) and (s ,si,s 2 ) ~ Mul(s; g , 9i, 92), where pi = 
Pr(G?j|case) and qi = Pr(Gj| control) for i = 0,1,2. 
Under the null hypothesis of no association, Hq : pi = 
qi for all i. 

Denote the penetrance of the SNP by fi = Pr(case| 
Gi), and the disease prevalence by k = Pr(case). 
Then p % = gifi/k and ^ = ^(1 - /i)/(l - k). Hence, 
the null hypothesis becomes Hq ■ fo = fi = f 2 = k. 
For simplicity, we assume in this section there is 
only one functional locus. Therefore, there is only 
one genetic model. 

2.2 Perfect LD Model 

Under this model, the SNP is also the functional 
locus with equal allele frequencies. The penetrances 
fi, i = 0, 1, 2, defined earlier are also penetrances of 
the functional locus. Genotype relative risks (GRRs) 
are defined by Aj = fi/fo for i = 1,2, where fo is 
the reference penetrance. Under the alternative hy- 
pothesis, allele B is the risk allele if the probabil- 
ity of having the disease increases with the number 
of B alleles in the genotype. That is, / 2 > /1 > fo 



and f 2 > fo- These two constraints define a family 
of constrained genetic models, which contains four 
commonly used genetic models: 

(1) A = {(Ai,A 2 ):A 2 >Ai and A 2 > 1}. 

We refer to A as the constrained space for genetic 
models when the risk allele is known. The null hy- 
pothesis corresponds to Hq : Ai = A 2 = 1- The ge- 
netic model is recessive if Ai = 1, additive if Ai = 

1/2 

(1 + A 2 )/2, multiplicative if Ai = A 2 , and domi- 
nant if Ai = A 2 . Let A 2 = A for some A > 1. Then 
Ai can be calculated using A value under one of the 
four genetic models. The first three letters of each 
model are used to indicate the genetic model in the 
following, for example, REC stands for the recessive 
model. 

Note that A does not contain overdominant or un- 
derdominant models, which occurs when Ai > A 2 > 
1, Ai > 1 and A 2 > 1 > Ai, A 2 > Ai, respectively. These 
two models are less common compared to the other 
four genetic models reviewed here. 

2.3 Incomplete LD Model 

Under this model, the SNP of interest is not the 
functional locus. Suppose the functional locus also 
has two alleles, denoted by a and 6, with the popu- 
lation frequencies Pr(fe) = q and Pr(a) = q c = 1 — q. 
Assume that the SNP with alleles A and B is asso- 
ciated with the disease through LD with the func- 
tional locus with alleles a and b. Table 1 repre- 
sents the joint probabilities of the two loci, in which 
D = Pr(Aa) — Pr(^4) Pr(a) measures LD between the 
SNP and the functional locus. When D = 0, they are 
in linkage equilibrium. An association between the 
SNP and a disease can be established when \D\ > 
and when the two loci are linked. 

There are two commonly used measures of the re- 
lationship between the SNP and the functional lo- 
cus: D' and the correlation between the alleles A 
and a. Denote pAa = Pr(Aa), pAb = Pr(^46), ps a = 



Table 1 

Joint probabilities of the marker and functional locus under 
incomplete LD model 



Marker 


Functional locus 




a 


b 


A 


p c q c + D 


p c q - D 


Pc 


B 


pq c - D 


pq + D 


p 




Qc 


q 


1 
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Pi(Ba), and p Bb = Pi(Bb). Then D = pAaPBb ~ 
PAbPBa- The measure D' S [—1,1] of Lewontin (1964) 
is defined as 

if D > 0; 



D' 



mm(q c p,p c 
D 



if D< 0. 



min(g c p c ,pg)' 

When the SNP is identical to the functional locus 
(i.e., A = a, B = b and p = q), p B b = P, PAa = Pc, 
and pAb = PBa = 0. Thus, D' = 1. However, D' = 1 
can be reached when the SNP is not identical to the 
functional locus (e.g., when p^ q). The correlation 
between the two alleles is defined as (Weir, 1996) 



Corr(A, a) 



PAaPBb ~ PAbPBa 



y/PPcQQc 

Note that the correlation reaches its maximum value 
only when p = q. The LD model is complete if \D'\ = 
1 and perfect if | Corr(A, a)\ = 1. In this article we 
assume the two loci have the same allele frequencies. 
Thus, D' and the correlation are equivalent. That is, 
in this article the (im)perfect LD model is equivalent 
to the (in) complete LD model. 

In the simulations we specify D' , p and q. Then, D 
can be calculated. Using Table 1, the four haplotype 
frequencies pAa, PAb-, PBa and p B b can be obtained 
by replacing D in Table 1 by D' mm(q c p,p c q) when 
D > (a similar term is used when D < 0). 

The definition of a genetic model under the imper- 
fect LD model differs from that under the perfect LD 
model. Denote the genotypes at the functional locus 
by Gq = aa, G\ = ab and G 2 = bb. The penetrance 
of the functional locus is given by /* = Pr(case|G*) 
for i = 0,1,2. Accordingly, define GRRs by A* = 
/*//o for i = 1, 2. The penetrance of the SNP is the 



same as before and still denoted by /j. Denote f = 
(/o,/l,/2)*, f* = (/ * 5 /i* 5 /!)*> where * is transpose 
and P* = (Pr(ar|G i )) 3 x3 and P = (Pr(G i |G*)) 3 x3 
are 3x3 transition matrices. Then we have 



(2) 
(3) 



f = P**f*, 
f * = P*f . 



Under the perfect LD model, the two transition ma- 
trices are identity matrices P* = P = I. The con- 
ditional probabilities in (2) can be obtained using 
Pv(G*\G j ) = Pv(GlG j )/p =0 Pv(GlG j ) under the 
Hardy- Weinberg proportions at both SNP and func- 
tional locus, which are given in Table 2. Note that 
these are functions of the four haplotype frequencies. 
The conditional probabilities in (3) can be obtained 
similarly, and can also be found in Nielsen and Weir 
(1999) and Hanson et al. (2006), Table 3. 

2.4 Properties of Genetic Models under the 
Imperfect LD Model 

We defined genetic models using penetrances (/o, 
/11/2) at the SNP of interest. Under the imper- 
fect LD model, the genetic model should be defined 
at the functional locus using • Thus, the 

REC, ADD, MUL or DOM models correspond to 
A* = 1, A? = (X* 2 + l)/2, AJ = X* 2 1/2 , or A? = X* 2 , re- 
spectively. A constrained family of possible genetic 
models at the functional locus is given by 

(4) A* = {(Xl,X* 2 ):X* 2 >Xl andA^>l}. 

Note that A and A* are different under the imperfect 
LD model, and they are linked by the two transition 
matrices in (2) and (3). Under the imperfect LD 
model, applying Table 2 to f t = £ - =0 Pr(G^Gi)/;, 



Table 2 

Conditional probabilities in the transition matrix (2) 



Pr(GJ|Gi) 



Formula 



Pr(G5|G ): 
Pr(GS|Gi): 
Pr(G5|G 2 ): 
Pr(GJ|Go): 
Pr(Gi|Gi): 
Pr(G?|G 2 ) : 
Pr(G^|Go): 
Pr(G3|Gij = 
Pr(G 2 |G 2 ): 



Pr{aa\AA) 
Pr(aa\AB) 
Pr(aa\BB) 
Pi(ab\AA) 
: Pr(ab\AB) 
Pi(ab\BB) 
Pr(bb\AA) 
Pr(bb\AB) 
Pr(bb\BB) 



PAa/iPAa + %P 'AaP 'Ab + P A b) = F? 

PAaPBa/iPAaPBa + PAaPBb + PAbPBa + PAbPBb) = Fl F 2 
PBa/{PBa + ZpBaPBb + P B b) = *2 
2p A aPAb/{PAa + ^P AaP Ab + P A b) = 2-Fl-F]i 

(jpAaPBb + PAbPBa) /(pAaPBa + PAaPBb + PAbPBa + PAbPBb) 
ZpBaPBb/ (PBa + ^PBaPBb + P%b) = 2F 2 F 4 
P 2 Abl\p 2 Ab + 2 P AaPAb + PAb) = F? 



= FiF 4 + F 2 F 3 



PAbPBb/ {PAaPBa + PAaPBb + PAbPBa + PAbPBb) 
PBbl (PBa + 2p B aPBb + PBb) = F l 



= F 3 F 4 



Fx = ( Pc q c + D)/ Pc , F 2 = (pq c - D)/p, F 3 = (p c q - D)/p c , F 4 = (pq + D)/p 
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we have 



(5) 
(6) 



fo 
fr 



f *(F 2 + 2F 1 F 3 X* 1 + F 2 X* 2 ), 

f£{FiF 2 + (F1F4 + F 2 F 3 )Xl + -F3-F4A2}, 



(7) h = fo (F% + 2F 2 F A X\ + F 2 X* 2 ) . 

The true disease model at the functional locus, de- 
fined using (A^A^), is unknown. We study proper- 
ties of the penetrances (fo,fi,f2) or GRRs (Ai,A2) 
defined at the SNP given (A|,A|). 

Theorem 2.1. Under the imperfect LD model 
with \D'\ < 1, if (X^, X 2 ) € A* at the functional locus, 
then (Ai, A2) € A at the marker locus. Moreover, for 
(A|,A|) € A* - {(1,1)}, ifX\ = l (orX\ = X* 2 ), then 
Ai > 1 (or X 2 > Xi). 



Proof. Using F 2 — F\ 
and (5) to (7), we obtain 

foD 



-D/( PPc 



(8) /1-/0 

(9) /2-/1 



PPc 
PPc 



{F 1 (X* 1 -l) + F 3 (Xl 
{F 2 (Xl-l) + F A (X* 2 



-(F4-F3 

At)}, 
At)}- 



It follows that f% > f\ > /o and f 2 > /o when > 
f'l > fo and f 2 > /q . The proof of the second claim 
is trivial using the above two expressions and that, 
from Table 1, all F{, i = 1,2,3,4, are positive. □ 

Theorem 2.1 shows that when the GRRs are con- 
strained in A* at the functional locus, they are also 
constrained to a subset of A at the SNP when \D'\ < 
1. In addition, when the true disease model is either 



REC or DOM at the functional locus, it is no longer 
REC or DOM at the SNP, respectively. They are 
"closer" to the ADD/MUL models. The implication 
of this finding is that one will not see a pure DOM 
or REC model at the marker locus if the constrained 
model space A* is considered at the functional locus. 
It also provides a rationale for the genetic model se- 
lection approach (Zheng and Ng, 2008) in that an 
ADD/MUL is always chosen unless there is strong 
evidence to indicate the REC or DOM models. 

Even though the REC (or DOM) model at the 
functional locus is no longer retained at the SNP 
when \D'\ < 1, the ADD (or MUL) model is re- 
tained. Dividing (8) and (9) by fo, we obtain 

foD 2 



(10) 2Ai - 1 - A 2 



foP 2 P 2 



(2X1-1- X*). 



Using (5) to (7) to expand A2 



A 2 



and (F 2 F 3 
(11) 



A2 — Ai 



--D 2 /(p 2 p 2 c ) 

f*2 n2 

(A* 

foP 2 P 2 c 2 



1 - (/2/0 - /l)//o 
we obtain 



Af) 



The above two equations lead directly to the follow- 
ing result. 

Theorem 2.2. Under the imperfect LD model 
with \D'\ < 1, when the genetic model is ADD (X\ = 
(1 + X* 2 )/2) or MUL (X* 2 = Xf) at the functional lo- 
cus, the same model is retained at the marker locus. 

Figure 1 displays the mapping of genetic models 
from A* to A under the imperfect LD model. If we 
still define a genetic model at the marker locus un- 
der the imperfect LD model, then, using (3) and a 



GRR space A" 



A*2 



MU L> ADD 
REC ." / 1 



it? 



DOM 



GRR space A 



hi 



f=rf* 




r 



MUL 



ADD 



REC 



/ 

: /// 



DOM 



V 



if 

f 

7 



Ai 



Fig. 1. Plots of the GRR spaces A* and X under the inperfect LD model. 
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table similar to Table 2, the REC or DOM models 
at the marker locus would correspond to the under- 
dominant or overdominant models at the functional 
locus, respectively. 

3. THE HARDY-WEI N BERG 
DISEQUILIBRIUM COEFFICIENT AND 
GENETIC MODEL SELECTION 

The Hardy-Weinberg disequilibrium (HWD) co- 
efficient in cases or between cases and controls has 
been used to detect association (Nielsen, Ehm and 
Weir, 1998; Zaykin and Nielsen, 2000; Song and El- 
ston, 2006). In addition, it can also be used to de- 
tect the underlying genetic model at the marker lo- 
cus (Wittke-Thompson, Pluzhnikov and Cox, 2005; 
Zheng and Ng, 2008). In this section we first review 
the HWD coefficient and how it can be used to de- 
tect the genetic model at the SNP of interest. Then 
we study whether it can still be used to detect the 
genetic model which is defined at the functional lo- 
cus under the imperfect LD model. 

Using the notation in Section 1, the HWD coeffi- 
cient at the SNP with alleles A and B is given by 
(Weir, 1996) 

A = Pr(AA) - {Pr(AA) + Pr(AB)/2} 2 

= S2-G?2+ffi/2) 2 . 

In cases and controls, it is denoted by Ai and Ao, 
respectively, and given by 

Ai =P2 ~ (P2 +pi/2) 2 and 
A = q 2 -(q2 + qi/2) 2 . 

Substituting pi = gifi/k and qi = gi{l - fi)/{l - k) 
under the Hardy-Weinberg proportions (A = 0), one 
has (Wittke-Thompson, Pluzhnikov and Cox, 2005; 
Zheng and Ng, 2008) 



(12) A 1 

(13) A 



foP 2 Pc 
k 2 

foP 2 P 2 c 



(A2-A 2 ), 



(2Ai - 1 - A 2 - f Xl + / A 2 ) 



u (l-k) 2 

Using the signs of (Ai,Ao), Zheng and Ng (2008) 
divided A in (1) into four mutually exclusive re- 
gions R\ to R4. The signs in the four regions are 
(Ai,A ) = (+,-) in R x , (-,-) mR 2 , (-,-) in R 3 , 
and (— ,+) in R4. The REC model belongs to Ri 
and the DOM model belongs to R4. The region R 2 is 
bounded by the ADD and MUL models (see Figure 1 
of Zheng and Ng, 2008). Therefore, under the REC 



model (defined at the SNP with Ai = 1), A x > 
and Ao < 0, and under the DOM model, Ai < and 
A > 0. Zheng and Ng (2008) used <9A = Ai - A as 
a genetic model indicator. The REC model implies 
that <9A > 0, while the DOM model implies OA < 0. 
A normalized test statistic based on d A = A± — Aq, 
where Pi = ri/r and % = Si/s, is given 



^HWDTT 



(rs/nf^dA 



{1 - n 2 /n - ni/(2n)}{n 2 /n + ni/(2n)} 
iV(0,l) 



under Hq and referred to as the HWD trend test 
(HWDTT) (Song and Elston, 2006). It is used to 
select a genetic model (Zheng and Ng, 2008). Given 
that B is the risk allele, the ADD (or MUL) model 
is chosen unless there is strong evidence to indicate 
a REC model or a DOM model. When Zhwdtt > 
1.645, the REC model is selected; when Zhwdtt < 
— 1.645, the DOM model is selected. 

Under the imperfect LD model, using (11) and 
(10), (12) and (13) can be written as 



A 1 



fo 2 D 2 
k 2 

(1-k) 2 



(XI 



Af) 



(2a; - 1 - a^ 



/ *Af + / *A^) 



Comparing the above (Ai,Ao) with (12) and (13), 
we see that the signs of (Ai, Ao) do not change when 
the genetic model is defined at the functional locus. 
Hence, the model selection procedure of Zheng and 
Ng (2008) can still be used. 

4. ROBUST TESTS 

4.1 Pearson's Test and CATTs 

Given the case-control data for a single SNP, (ro, 
ri,r 2 ) and (sq,si,S2), denote Hi = Ti + S{ for i = 
0, 1, 2 and n = uq + ri\ + n 2 . Pearson's test can be 
written as 

2 

T x 2 = yj^Tj - nir/n) 2 /(n.ir/n) 
i=0 



+ X^ S * -nis/n) 2 /(nis/n), 



i=0 



which asymptotically follows a chi-squared distribu- 
tion with 2 degrees of freedom (df) under Hq. The 
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CATT with a score x G [0, 1] is given by 




/ [rs{n( ni + 4n 2 ) - (n x + 2n 2 ) 2 }] 1/2 , 

where (xo^i)^) = (0,x, 1). Under f/~0j ^ asymp- 
totically follows the standard normal distribution 
iV(0, 1) for a given x. Optimal scores for REC, 
ADD/MUL and DOM models are x = 0, 1/2 and 1. 

When the genetic model is unknown, is often 
used. There is a trade-off between T x 2 and Z x with 
x = 1/2. Pearson's test is more robust but less pow- 
erful, in particular, under the ADD or DOM models, 
while the trend test is more powerful under the ADD 
or DOM models but less robust when the score x is 
misspecified. Pearson's test is identical to the trend 
test Z% with x = (ri/ni — ro/no)/(si/ni — sq/uq) 
(Yamada and Okada, 2009; Zheng, Joo and Yang, 
2009). In practice, however, x is prespecified. Thus, 
this condition is rarely satisfied. 

4.2 MAX 

To avoid the trade-off between Pearson's test and 
the CATT, one approach is to consider maximum 
tests. A typical maximum test is given by (Freidlin 
et al., 2002; Sladek et al., 2007) 

MAX 3 = max{|Z |,|Z 1/2 |,|Z 1 |}. 

Other versions of maximum tests are also used, for 
example, MAX = sup^o.i] \%x\ (Davies, 1977, 1987), 
the maximum of three likelihood ratio tests under 
various genetic models (Gonzalez et al., 2008), and 
for a quantitative trait (Lettre, Lange and Hirschhorn, 
2007). 

Computational aspects of maximum tests have been 
discussed by Conneely and Boehnke (2007) and Li et 
al. (2008a). The empirical distribution of MAX3 can 
be obtained from simulation using the joint multi- 
variate normal distribution of the CATTs consider- 
ing asymptotic null correlations among them (Frei- 
dlin et al., 2002) or from a parametric bootstrap pro- 
cedure by generating data using (ro,?r,r2) ~ 
Mul(r;p ,Pi,P2) and (s , si, s 2 ) ~ Mul(s;p ,pi,p 2 ), 
where pi = rii/n. A simpler algorithm to find the 
asymptotic and empirical null distributions of MAX3 
is recently proposed (Zang, Fung and Zheng, 2010). 
The asymptotic null distribution of MAX3 is a func- 
tion of the minor allele frequency (MAF) of the SNP. 
In a genome-wide scan to rank a large number of 



SNPs, Li et al. (2008b) demonstrated that ranking 
can be done easily by the values of MAX3 rather 
than by their p-values. Hence, there is no need to 
calculate the p- values of MAX3 , even though the p- 
values of MAX3 are more comparable across SNPs. 

4.3 MIN2 

An alternative approach used by WTCCC (2007) 
utilizes both Pearson's test and the CATT Zi/ 2 . 
WTCCC (2007) proposed to use the minimum of 
the p- values of T x 2 and Z\i 2 to scan all the SNPs. 
SNPs with the minimum p- value less than a thresh- 
old level were retained for further analyses. Joo et 
al. (2009) denoted the minimum of the two p- values 
by 

MIN2 = min{pT 2 ,p Zl/2 } 

and obtained its asymptotic null distribution and its 
p- value, denoted by PMIN2- The key formula to find 
the distribution and p- value for MIN2 is the joint 
distribution of Pearson's test and Zx/2 under Hq, 
which is given by (Joo et al., 2009) 

Pr(Z 1 2 /2 <t 1 ,T x2 <t 2 ) 

= 1 _ I e -V2 _ i/ 2e -W2 
2 ' 

H / e~ v ' 2 arcsinl 1 I dv, 

2vr J tl \ v J 

when t\ < t 2 , and Pr(Z 2 ^ 2 < ti,T x 2 < t 2 ) = 1 — 
exp(— 1 2 /2) when t\ > t 2 . Unlike MAX3, the asymp- 
totic null distribution of MIN2 does not depend on 
the MAFs of SNPs. Hence, MIN2 itself can be used 
to rank all SNPs, which results in the same ranks as 
when the p-value of MIN2 is used. Joo et al. (2009) 
demonstrated that PMIN2 > MIN2, because Z^ 2 and 
Ty2 are correlated under the alternative hypothesis. 
Thus, MIN2 itself cannot be used as the p-value. 

4.4 The Genetic Model Selection (GMS) 
Procedure 

The GMS procedure is an adaptive approach. It 
contains two phases. In phase 1 the underlying ge- 
netic model is detected using the value and sign of 
^hwdtt (Song and Elston, 2006; see also Section 
3). Once the model is selected (REC, ADD/MUL 
or DOM), in the second phase, the CATT optimal 
for the selected model is applied to test for asso- 
ciation. For example, if the REC model is selected 
using the HWDTT, Zq would be used in phase 2 to 



8 



G. ZHENG ET AL. 



test for association. Since the analyses in the two 
phases are correlated, Zheng and Ng (2008) derived 
the asymptotic null correlation for the GMS. This 
correlation is incorporated in the distribution of the 
test statistics to control for the Type I error. Like 
MIN2, computing the p- value of the GMS requires 
integrations. Like MAX3, the GMS can be used to 
rank SNPs (Zheng et al., 2009). Using test statistics 
to directly rank SNPs is easier than using p-values 
of the GMS. Since the GMS depends on which allele 
is the risk allele or whether the minor allele is the 
risk allele, for each SNP, we first determine the risk 
allele (B is risk allele if Z\ji > 0). If the risk allele is 
B, then the above GMS can be applied. Otherwise, 
we can switch the two alleles and apply the above 
GMS. 

4.5 Other Tests 

Balding (2006) provided an excellent review of 
statistical methods for the analysis of association 
studies. Two other robust two-phase tests are also 
available that we do not include here. One feature 
of these methods is that the test statistics in two 
phases are asymptotically independent under Hq 
(Zheng, Song and Elston, 2007, Zheng et al., 2008). 
In this case, the second phase can be used as a "self- 
replication," an idea proposed in van Steen et al. 
(2005). Alternatively, the significance level a can be 
decomposed to (01,02) such that 0102 = o, where 
oi is used for the phase 1 analysis and 02 for the 
phase 2 analysis. The null hypothesis is rejected 
when analyses in both phases are significant at their 
corresponding levels. Choices of a\ and «2 with a±a = 
a in GWAS were discussed in Zheng, Song and El- 
ston (2007), Zheng et al. (2008). Another robust 
test is the constrained likelihood ratio test (LRT) 
(Wang and Sheffield, 2005). It is similar to the LRT 
except that the alternative space is restricted to 
A — {(1,1)}. The performance of the constrained 
LRT is similar to that of MAX3 described above. 
Thus, we only consider MAX3 here. 

4.6 Why Robust Tests? 

One of the reasons that we use robust tests in 
GWAS is that there might be multiple functional 
loci for a given disease. The modes of inheritance or 
genetic models may differ from one functional locus 
to the other. Another reason for using robust tests 
is the distortion of the actual genetic model at the 
marker locus due to incomplete LD, which further 
amplifies uncertainty about the model. Thus, robust 



tests are generally preferred. We use efficiency ro- 
bustness to measure robustness (Gastwirth, 1985). 
A test T\ is said to have greater efficiency robust- 
ness than a test T2 if the worst asymptotic relative 
efficiency of T\ to the asymptotically optimal test 
across all genetic models is higher than the worst 
asymptotic relative efficiency of T2 . The C ATT Z\ m 
optimal for the ADD model is most robust among all 
trend tests when the genetic models are constrained 
in A. Pearson's test is also robust because it does not 
require the genetic models to be constrained or the 
alternative hypothesis to be ordered. When restrict- 
ing to A, tests more robust than Z1/2 are available. 
MAX3 and GMS are two examples. They both have 
greater efficiency robustness than Pearson's test and 
Zx/2 (Preidlin et al., 2002; Zheng and Ng, 2008). 
On the other hand, combining information of both 
Pearson's test and Z1/2, MIN2 is also more efficiency 
robust than either Pearson's test or Zyi^- Three ro- 
bust tests, MAX 3 , GMS and MIN2, appear to have 
comparable efficiency robustness in candidate-gene 
studies (Joo et al., 2009). 

In genome-wide scans it is desirable to locate the 
SNPs representing true association as near the top 
as possible, where all SNPs compete for the top 
ranks. Under the complete LD model, Zheng et al. 
(2009) conducted simulation studies comparing the 
three robust methods in ranking 300,000 SNPs, 
among which there were 6 functional loci with dif- 
ferent genetic models, MAFs and GRRs (from 1.25 
to 1.5). The results showed that the GMS slightly 
outperforms MIN2 and MAX 3 when the top 5000 
SNPs were selected. The criteria used for compari- 
son included the probability that the top 5000 SNPs 
contained at least one SNP with true association, as 
well as the minimum and average ranks of SNPs 
with true associations among the top 5000 SNPs. 
We will conduct similar simulation studies in Sec- 
tion 5 under the inperfect LD model. The reason 
that we choose the top 5000 SNPs rather than a 
smaller number, say, the top 100, is that the SNPs 
with true association are not always ranked near 
the top, especially for a small GRR between 1.2 and 
1.5 and small sample sizes (Zaykin and Zhivotovsky, 
2005). If we examine the top 100 list with 250 cases 
and 250 controls (the sample sizes that we used in 
our simulation studies), the probability that the list 
of the top 100 SNPs contains a true association is 
less than 0.50. 
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5. SIMULATION STUDIES 

5.1 The GMS Procedure under the Imperfect 
LD Model 

We first conducted simulation studies to estimate 
the distribution of genetic models selected by the 
GMS. We chose disease prevalence k = 0.1 and GRR 
A2 = 2 at the functional locus. Then A^ was obtained 
using A2 and a given genetic model at the functional 
locus. We considered 0.1, 0.3 and 0.5 for the equal 
MAFs at a SNP (p) and a functional locus (q). This 
allows us to compare the frequencies of the different 
models selected when D' = 1.0, 0.8 and 0.6. With 
equal allele frequencies p = q, Corr(^4,a) = D' . In 
each of 10,000 replicates, 250 cases and 250 con- 
trols were simulated from multinomial distributions 
in which the penetrances at a SNP were calculated 
using (5) to (7). When the GMS did not select REC 
or DOM, the ADD or MUL models are used and de- 
noted here by A/M. Results are reported in Table 
3. 

When the true model is REC or DOM at the 
functional locus, the frequencies that the model se- 
lected by the GMS at the marker locus is REC 
or DOM decreases dramatically when D' becomes 
small. For example, when p = q = 0.3, the frequency 
of selecting REC at the marker locus is about 67.5% 
when the true model at the functional locus is REC, 
and D' = 1. This frequency declines to 18.6% when 
D' = 0.6. These frequencies, however, are not sensi- 
tive when the true model at the functional locus 
is either ADD or MUL. The findings are consis- 
tent with Theorems 2.1 and 2.2. Given the genetic 
model space A* at the functional locus, the genetic 
model space at the marker locus A is shifted to- 
ward the center of the space A* corresponding to 
the ADD/MUL models. 

Table 4 reported the GRRs at the marker locus 
given those at the functional locus. Note that when 



Table 4 

GRRs (Ai, A2) at a SNP given GRR AJ5 = 2 at the functional 
locus: p = q = 0.3. When D' = 1, A* = Ai for i=l,2 



True 
model 




£>7(Ai,A 2 ) 




1.0 


0.8 


0.6 


REC 
ADD 
MUL 
DOM 


(1.00, 2.00) 
(1.50, 2.00) 
(1.41, 2.00) 
(2.00, 2.00) 


(1.05, 1.73) 
(1.38, 1.75) 
(1.22, 1.48) 
(1.67, 1.77) 


(1.07, 1.50) 
(1.27, 1.54) 
(1.24, 1.53) 
(1.43, 1.57) 



the true model is ADD (AJ = (1 + X* 2 )/2) or MUL 
(AJ 2 = A2), the GRRs at the marker locus follow 
the same models. However, Aj are smaller than A*. 
Similar patterns are observed when the true model 
is REC or DOM, except that Ai is slightly greater 
than XI under the REC model. 

5.2 Comparison of Robust Tests in GWAS under 
the Imperfect LD Model 

In Table 3 when the true model is REC or DOM 
at the functional locus, the GMS could not select 
REC or DOM at the marker locus. This, however, 
does not mean that the GMS cannot improve power 
or chances of true discoveries when | Corr(A,a)| < 1. 
On the contrary, owing to the shrinkage of the ge- 
netic model space and that the GMS only selects a 
model at the marker locus, it can be viewed as select- 
ing an appropriately induced model at the marker 
locus. Our next simulation will examine the per- 
formance of robust tests under the imperfect LD 
model. The simulation procedure follows the one 
used in Zheng et al. (2009). We simulated geno- 
type counts for each of 300,000 SNPs, among which 
6 SNPs have true associations and D' = 0.8 with 
MAF of 0.2 at the functional loci. When D' = 1, 
the number of functional loci is also 6. However, 
when D' = 0.8, we assume the number of functional 
loci equals the number of different genetic models in 
the simulation. Zheng et al. (2009) considered the 
perfect LD model that corresponds to \D'\ = 1 or 
I Corr(^4,a)| = 1. Their results are repeated here for 
comparison. The MAFs of 6 true SNPs from the 
genetic models listed in the titles of Tables 5 and 
6 were 0.1821, 0.2943, 0.1078, 0.4459, 0.1620 and 
0.1825. These are also given in Zheng et al. (2009) 
and in Li et al. (2008b). MAFs for the rest of the 
null SNPs were simulated from a uniform distribu- 
tion [7(0.1,0.5). The GRRs for the functional loci 
were all 1.25 (or 1.50). We applied five robust tests 
(Z1/2, Pearson's test T % 2, GMS, MIN2 and MAX 3 ) 
to rank all SNPs and the top 5000 SNPs were se- 
lected from each of 200 replicates. The criteria to 
compare the performance of robust tests include the 
probability (prob %) of at least one true SNP be- 
ing selected among the top 5000 SNPs, the average 
number of true SNPs among the top, and the mean 
of the minimum ranks of the true SNPs among the 
top. The results are presented in Table 5 (2 REC, 
1 ADD, 1 MUL and 2 DOM SNPs) and Table 6 (1 
REC, 2 ADD, 2 MUL and 1 DOM SNPs). 
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Table 3 

Distributions of genetic models selected by the GMS using the HWDTT (%): Disease prevalence fc = 0.1, the GRR at the 
functional locus A2 = 2 with 250 cases and 250 controls and 10,000 replicates 



D'/selected models (A/M = ADD/MUL) 



MAF 
p = q 


True 
model 




1.0 






0.8 






0.6 




REC 


A/M 


DOM 


REC 


A/M 


DOM 


REC 


A/M 


DOM 


0.1 


REC 


23.3 


76.3 


0.4 


14.6 


84.4 


1.0 


3.0 


90.6 


6.4 




ADD 


2.6 


88.9 


8.5 


2.4 


90.2 


7.4 


2.9 


90.8 


6.3 




MUL 


3.4 


90.3 


6.3 


3.7 


91.1 


5.2 


3.8 


90.8 


5.4 




DOM 


0.1 


60.1 


39.8 


0.3 


76.1 


23.6 


1.0 


84.8 


14.2 


0.3 


REC 


67.5 


32.5 


0.0 


39.4 


60.4 


0.2 


18.6 


80.6 


0.8 




ADD 


2.2 


88.9 


8.9 


3.1 


89.3 


7.(i 


3.7 


89.6 


6.7 




MUL 


4.8 


90.7 


4.5 


5.0 


90.4 


4.6 


5.2 


90.1 


4.7 




DOM 


0.0 


32.8 


67.2 


0.1 


61.4 


38.5 


0.7 


80.2 


19.1 


0.5 


REC 


66.0 


34.0 


0.0 


36.8 


63.1 


0.2 


18.3 


80.9 


0.8 




ADD 


2.6 


89.0 


8.4 


3.3 


89.6 


7.1 


3.7 


90.8 


5.5 




MUL 


5.4 


89.9 


4.7 


5.0 


90.1 


4.9 


5.2 


89.9 


4.9 




DOM 


0.0 


36.2 


63.8 


0.1 


63.9 


36.0 


0.8 


81.2 


18.0 



First, when D' = 1 (Zheng et al., 2009), the GMS 
outperforms other tests under all three criteria, while 
Pearson's test had the worst performance. When 
D' = 0.8, however, the GMS and Zi/ 2 had simi- 
lar performances, which together outperform other 
tests using the three criteria. This finding is consis- 
tent to our results in Theorems 2.1 and 2.2 about 
the genetic models under the imperfect LD model. 

6. APPLICATIONS TO WTCCC DATA 

We apply the five robust tests to a genome-wide 
scan using more than 300,000 SNPs after quality 
control. The study was originally conducted by 
WTCCC (2007) for seven diseases (type 1 diabetes — 
T1D, type 2 diabetes — T2D, coronary heart disease — 
CHD, hypertension — HT, bipolar disorder — BD, 
rheumatoid arthritis — RA and Crohn's disease — CD). 
About 2000 cases were used for each disease and 
3000 controls were shared for the seven diseases. 
WTCCC (2007) used MIN2 to test for association 
after the quality control. They obtained two tables 
presenting SNPs with strong associations with 
MIN2 < 5 x 10~ 7 (Table 3 of WTCCC, 2007) and 
SNPs with moderate associations with 5 x 10~ 7 < 
MIN2 < 5 x 10" 5 (Table 4 of WTCCC, 2007). We 
reanalyze these data by ranking all SNPs after our 
quality control. The goal of this application is to 
demonstrate the efficiency robustness of different 
test statistics, not to find SNPs with associations 
that were not reported in WTCCC (2007). 



In our application, for each of the seven diseases, 
we rank all SNPs after quality control (398,092 SNPs) 
using the five robust tests and report the ranks of 
the SNPs that were reported to have strong asso- 
ciations in WTCCC (2007), Table 3. Note that we 
do not know D' in reality, nor do we know the num- 
ber of functional loci and their modes of inheritance. 
Our results are reported in Table 7. The results show 
that SNPs with strong associations are all ranked on 
the top 5000 SNPs. The CATT is least robust among 
the five robust tests as shown by the rank 269 for 
BD, while the ranks by the other methods are less 
than 25. The GMS tends to have smaller ranks than 
MAX3, and MIN2 tends to have ranks between the 
CATT and Pearson's test, which often have higher 
ranks than the GMS. 

We also studied the ranks of SNPs with moderate 
associations reported in WTCCC (2007), Table 4. 
The detailed results are not shown here, but summa- 
rized below. Similar patterns are also observed, al- 
though, for several SNPs, the CATT has large ranks. 
For example, for BD, the CATT has rank 147,769 for 
SNP rs6458307 on chromosome 6, while the ranks of 
other tests for this SNP are less than 150. For T2D, 
the CATT has rank 197,064 for SNP rs358806 on 
chromosome 3, while the other tests have ranks less 
than 100. All ranks of SNPs with either strong or 
moderate associations are less than 5000, and only 
one SNP (rsl7166496 for T1D on chromosome 5) 
is ranked more than 5000 by MAX 3 and the GMS. 
The actual ranks for this SNP are 5521 for the GMS 
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Table 5 

Genome-wide scans of 300,000 SNPs containing 6 true SNPs (2 REC, 1 ADD, 1 MUL and 2 DOM). Only the top 5000 
SNPs are selected. The results are based on 200 replicates: MAF q = 0.2 at the functional locus when D' = 0.8. Samples sizes 

arer = s= 1000 for GRR=1.25 andr = s = 500 for GRR=1.5 









D' = 1.0 






D' = 0.8 




GRR 


Robust 




Ave. no. of 


Mean of 




Ave. no. of 


Mean of 


A 2 


tests 


Prob 


true SNPs 


min ranks 


Prob 


true SNPs 


min ranks 


1.25 


Zl/2 


92.0 


1.79 


971 


58.5 


1.29 


1625 




GMS 


94.5 


1.90 


838 


56.0 


1.29 


1488 




MAX 3 


90.5 


1.80 


909 


48.0 


1.28 


1435 




MIN2 


89.5 


1.79 


934 


51.0 


1.25 


1550 




T x 2 


86.5 


1.69 


960 


46.5 


1.22 


1680 


1.50 


Zl/2 


99.5 


2.71 


186 


83.0 


1.49 


1041 




GMS 


100.0 


2.99 


178 


85.0 


1.54 


1111 




MAX 3 


99.5 


2.83 


205 


80.0 


1.48 


1183 




MIN2 


100.0 


2.78 


234 


80.0 


1.50 


1113 




7> 


100.0 


2.71 


286 


75.0 


1.46 


1244 



and 6063 for MAX 3 , 652 for Pearson's test, 724 
for MIN2, but 245,454 for the CATT. The under- 
lying genetic model for this SNP could be outside 
of the constrained genetic model that we considered 
here, for example, overdominant or underdominant 
for which it is known that Pearson's test is robust 
(Zheng, Joo and Yang, 2009; Joo et al., 2009). In 
addition, we found that for those SNPs with small 
ranks based on Pearson's test, a large rank using 
the CATT is always accompanied by a large value 
of the HWDTT. This is due to the orthogonal de- 
composition of Pearson's test to the HWDTT and 
Zy 2 (Zheng et al., 2008). It is also interesting to 
note that, even if a SNP has a rank smaller than 



those SNPs listed in Table 7, it does not mean the 
SNP has a true association with a disease. That is, 
in GWAS, a SNP with smaller p- value does not nec- 
essarily mean it has stronger association. In fact, 
many of these SNPs with smaller ranks have not 
been confirmed to have true associations (WTCCC, 
2007). This is because a very small number of SNPs 
(<100 SNPs) are associated with a disease in GWAS 
compared to the number of null SNPs (more than 
300,000 SNPs). Therefore, the probability that test 
statistics of some null SNPs are greater than those 
of all the associated SNPs is high (Zaykin and Zhiv- 
otovsky, 2005). 



Table 6 

Genome-wide scans of 300,000 SNPs containing 6 true SNPs (1 REC, 2 ADD, 2 MUL and 1 DOM). Only the top 5000 
SNPs are selected. The results are based on 200 replicates: MAF q = 0.2 at the functional locus and D' =0.8. Samples sizes 

arer = s = 1000 for GRR=1.25 andr = s = 500 for GRR=1.5 









D' = 1.0 






D' = 0.8 




GRR 


Robust 




Ave. no. of 


Mean of 




Ave. no. of 


Mean of 


A 2 


tests 


Prob 


true SNPs 


min ranks 


Prob 


true SNPs 


min ranks 


1.25 




88.0 


1.72 


897 


49.5 


1.31 


1564 




GMS 


87.0 


1.79 


797 


53.5 


1.27 


1630 




MAX 3 


82.5 


1.64 


846 


47.0 


1.24 


1702 




MIN2 


86.0 


1.66 


932 


48.5 


1.25 


1899 






83.0 


1.50 


1030 


41.5 


1.20 


1847 


1.50 


Zl/2 


99.0 


2.46 


349 


76.5 


1.48 


1083 




GMS 


99.5 


2.61 


355 


76.0 


1.47 


1005 




MAX 3 


98.0 


2.34 


379 


73.0 


1.40 


1103 




MIN2 


99.5 


2.35 


434 


74.0 


1.38 


1105 






97.0 


2.21 


485 


66.5 


1.31 


1179 
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Table 7 

Ranks of SNPs with strong association of seven diseases in WTCC'C (2007), Table 3 







(.111 Ul 11 


7, , 

■^1/2 




PMC 




MTN9 

1\ 111 N L 


BD 


rs420259 


16 


269 


22 


19 


20 


23 


CAD 


rsl333049 


9 


9 


25 


24 


24 


25 


V 1 J 


lallOUJOUO 


I 


14 


— o 




24 


24 




rsl0210302 


2 


6 


15 


15 


16 


15 




rs9858542 


3 


102 


58 


58 


61 


75 




rsl7234657 


5 


11 


25 


19 


20 


21 




rsl000113 


5 


72 


92 


78 


82 


84 




rsl0761659 


10 


89 


115 


100 


107 


101 




rsl0883365 


10 


50 


65 


59 


62 


61 




rsl7221417 


16 


25 


37 


35 


37 


38 




rs2542151 


18 


69 


84 


77 


80 


81 


RA 


rs6679677 


1 


50 


72 


71 


69 


70 




rs6457617 


6 


5 


13 


8 


8 


13 


T1D 


rs6679677 


1 


129 


137 


133 


136 


135 




rs9272346 


6 


3 


6 


5 


5 


5 




rslll71739 


12 


339 


361 


342 


357 


354 




rsl7696736 


12 


233 


245 


238 


243 


242 




rsl2708716 


16 


521 


534 


517 


534 


530 


T2D 


rs9465871 


6 


31 


41 


49 


44 


45 




rs4506565 


10 


10 


17 


17 


17 


16 




rs9939609 


16 


24 


38 


36 


36 


37 



7. DISCUSSION 

We studied some robust tests for case-control ge- 
netic association studies. This approach stems from 
the classical robust procedures studied in the 1970s 
which focused on the estimation of the location pa- 
rameter of a symmetric distribution. For a given 
family of underlying distributions (or, here, genetic 
models), an estimate with a high (low) minimum 
correlation, say, >0.80 (<0.50) with the optimal pro- 
cedure, indicates a greater (smaller) efficiency ro- 
bustness. In early work, the underlying distribution 
was assumed to range from the normal distribution 
to the Cauchy distribution (Tukey, 1965 and An- 
drews et al., 1965). For this family of ^distributions, 
the robust estimate of the location parameter was 
considered, because within the family of distribu- 
tions considered, it had minimum correlation with 
the optimal procedure of about 0.60 (Gastwirth, 
1966). In case-control genetic association studies, 
when the true genetic model is unknown and ranges 
from the REC to the DOM models, the minimum 
correlation of any two CATTs is about 0.30 (Frei- 
dlin et al., 2002). This indicates that using a single 
CATT for association is not robust, and tests that 



are robust across a family of plausible genetic mod- 
els are preferred. 

Previous studies of robustness properties of test 
statistics for the analysis of case-control genetic as- 
sociation studies have been focused on the perfect 
(or complete) LD model, that is, the genetic marker 
(SNP) is also the functional locus. In this article 
we studied genetic models under a general imper- 
fect (or incomplete) LD model with linkage disequi- 
librium between linked marker locus and functional 
locus. The perfect LD model is a special case. Un- 
der the imperfect LD model, we found that a ge- 
netic model defined by the genotype relative risks 
at the functional locus usually no longer remains 
the same genetic model at the marker locus, except 
for the additive or multiplicative models. The ge- 
netic model space at the marker locus is a subset 
of that at the functional locus, resulting in smaller 
genotype relative risks at the marker than at the 
functional locus. The power to detect a true associ- 
ation is reduced when the linkage disequilibrium de- 
creases, while the model uncertainty increases, com- 
plicating the choice of a single association statistic. 
Robust tests are shown to perform optimally in this 
situation. 
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We also review some common efficiency robust 
tests for case-control genetic associations and their 
usage in genome-wide scans. In genome-wide scans, 
all SNPs are ranked by a test statistic or its p- 
value (if the p- value is readily obtained) and the top- 
ranked SNPs are selected for further analyses. Alter- 
natively, as in WTCCC (2007), some genome- wide 
threshold levels can be also used to select SNPs. 
Multiple testing is an important issue in GWAS 
not only because one tests 300,000 up to a million 
SNPs, but also because multiple tests are available 
for each SNP (and there is no uniform most powerful 
test in GWAS). Correcting for multiple testing re- 
mains challenging in the analysis of GWAS (Roeder 
and Wasserman, 2009), and the need for indepen- 
dent replication studies (Kraft, Zeggini and Ioanni- 
dis, 2009) and proper meta-analysis (Pfeiffer, Gail 
and Pee, 2009) cannot be overemphasized. 
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